Show the code
pacman::p_load(jsonlite, tidygraph, ggraph,
visNetwork, graphlayouts, ggforce,
skimr, tidytext, tidyverse, ggplot2)Take home 3 - question 1
Junseok Kim
Jun 17, 2023
Through visual analytics, FishEye aims to identify companies potentially engaged in illegal fishing and protect marine species affected by it.
In this context, this page will attempt to answer the following task under Mini-Challenge 3 of the VAST Challenge: Use visual analytics to identify anomalies in the business groups present in the knowledge graph. Limit your response to 400 words and 5 images.
# A tibble: 27,622 × 5
id country type revenue_omu product_services
<chr> <chr> <chr> <dbl> <chr>
1 Jones LLC ZH Comp… 310612303. Automobiles
2 Coleman, Hall and Lopez ZH Comp… 162734684. Passenger cars,…
3 Aqua Advancements Sashimi SE Expr… Oceanus Comp… 115004667. Holding firm wh…
4 Makumba Ltd. Liability Co Utopor… Comp… 90986413. Car service, ca…
5 Taylor, Taylor and Farrell ZH Comp… 81466667. Fully electric …
6 Harmon, Edwards and Bates ZH Comp… 75070435. Discount superm…
7 Punjab s Marine conservation Riodel… Comp… 72167572. Beef, pork, chi…
8 Assam Limited Liability Company Utopor… Comp… 72162317. Power and Gas s…
9 Ianira Starfish Sagl Import Rio Is… Comp… 68832979. Light commercia…
10 Moran, Lewis and Jimenez ZH Comp… 65592906. Automobiles, tr…
# ℹ 27,612 more rows
21515 missing from revenue_omu column
# A tibble: 2,595 × 5
id country type revenue_omu product_services
<chr> <chr> <chr> <dbl> <chr>
1 Smith Ltd ZH Company NA Unknown
2 Williams LLC ZH Company NA Unknown
3 Garcia Inc ZH Company NA Unknown
4 Walker and Sons ZH Company NA Unknown
5 Walker and Sons ZH Company NA Unknown
6 Smith LLC ZH Company NA Unknown
7 Smith Ltd ZH Company NA Unknown
8 Romero Inc ZH Company NA Unknown
9 Niger River Marine life Oceanus Company NA Unknown
10 Coastal Crusaders AS Industrial Oceanus Company NA Unknown
# ℹ 2,585 more rows
There are 2595 dupe entries
# A tibble: 25,027 × 5
id country type revenue_omu product_services
<chr> <chr> <chr> <dbl> <chr>
1 Jones LLC ZH Comp… 310612303. Automobiles
2 Coleman, Hall and Lopez ZH Comp… 162734684. Passenger cars,…
3 Aqua Advancements Sashimi SE Expr… Oceanus Comp… 115004667. Holding firm wh…
4 Makumba Ltd. Liability Co Utopor… Comp… 90986413. Car service, ca…
5 Taylor, Taylor and Farrell ZH Comp… 81466667. Fully electric …
6 Harmon, Edwards and Bates ZH Comp… 75070435. Discount superm…
7 Punjab s Marine conservation Riodel… Comp… 72167572. Beef, pork, chi…
8 Assam Limited Liability Company Utopor… Comp… 72162317. Power and Gas s…
9 Ianira Starfish Sagl Import Rio Is… Comp… 68832979. Light commercia…
10 Moran, Lewis and Jimenez ZH Comp… 65592906. Automobiles, tr…
# ℹ 25,017 more rows
| Name | mc3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
mc3_graph %>%
filter(betweenness_centrality >= 100000) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(
linewidth = betweenness_centrality,
colors = "lightblue",
alpha = 0.5)) +
scale_linewidth_continuous(range=c(1,10))+
theme_graph() +
theme(text = element_text(family = "sans"))
# A tibble: 10 × 2
Products Occurrences
<chr> <int>
1 character(0) 16395
2 Unknown 4614
3 Fish and seafood products 63
4 Seafood products 55
5 Fish and fish products 31
6 Food products 31
7 Canning, processing and manufacturing of seafood and other aquat… 23
8 Footwear 21
9 Seafood 20
10 Grocery products 19
charactor(0) and Unknown will be cleaned later.
| Name | mc3_nodes1 |
| Number of rows | 37324 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 6 | 700 | 0 | 34121 | 0 |
| country | 29241 | 0.22 | 2 | 14 | 0 | 78 | 0 |
| type | 29241 | 0.22 | 7 | 16 | 0 | 3 | 0 |
| product_services | 29241 | 0.22 | 4 | 1737 | 0 | 1844 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 34139 | 0.09 | 939014 | 12435469 | 3652.23 | 8261.03 | 16966.67 | 48266.67 | 310612303 | ▇▁▁▁▁ |
stopwords_removed <- token_nodes %>%
anti_join(stop_words) %>%
filter(!is.na(word))
stopwords_removed %>%
count(word, sort = TRUE) %>%
top_n(15) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col() +
xlab(NULL) +
geom_text(aes(label = n), vjust = 0.5, hjust = -0.1, size = 2, color = "black")+
coord_flip() +
labs(x = "Count",
y = "Unique words",
title = "Count of Top 15 unique words found in product_services field")+
theme_minimal() 
[1] 5467
[1] 5688
Looks much neater now, about ~200 unique removed as part of stopwords removal. However, there are still with a lot of words not related to fishery
words_fishery <- c("fish", "seafood", "frozen", "food", "fresh", "salmon", "shrimp", "shellfish", "sea", "squid", "water", "seafoods", "foods", "marine", "shipment", "shipping", "pier", "carp", "cod", "herring", "lichen", "mackerel", "pollock", "shark", "tuna", "ocean", "oyster", "clam", "lobster", "crab", "crustaceans", "crustacean", "bass")
mc3_nodes_fishery <- mc3_nodes_unique %>%
filter(str_detect(product_services, paste(words_fishery, collapse = "|", sep = "")) | is.na(product_services))
print(mc3_nodes_fishery)# A tibble: 1,534 × 5
id country type revenue_omu product_services
<chr> <chr> <chr> <dbl> <chr>
1 Makumba Ltd. Liability Co Utopor… Comp… 90986413. Car service, ca…
2 Harmon, Edwards and Bates ZH Comp… 75070435. Discount superm…
3 Punjab s Marine conservation Riodel… Comp… 72167572. Beef, pork, chi…
4 Fisher Group ZH Comp… 29981457. Steel (marketin…
5 Morales, Young and Taylor ZH Comp… 23739782. Processed foods…
6 Morgan LLC ZH Comp… 17939781. Animal feed, an…
7 Neptune's Harvest LC Transport Riodel… Comp… 8726579. Frozen whole an…
8 Victoria Falls Limited Liabilit… Rio Is… Comp… 8014806. Domestic and in…
9 Caracola del Mar NV Family Rio Is… Comp… 7085566. Canned, frozen …
10 The Sea Lion NV Marine biology Oceanus Comp… 6264744. One- to five-da…
# ℹ 1,524 more rows
Result: none of the non-targets are in Node_fishery
| Name | mc3_edges_new |
| Number of rows | 3711 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 7 | 213 | 0 | 1493 | 0 |
| target | 0 | 1 | 6 | 27 | 0 | 2887 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
Maximum length of source is whopping 213, this is likely an input with lot of c(“,”) values.
mc3_edges_new_filtered <- mc3_edges_new %>%
filter(startsWith(source, "c("))
#step 2
mc3_edges_new_split <- mc3_edges_new_filtered %>%
separate_rows(source, sep = ", ") %>%
mutate(source = gsub('^c\\(|"|\\)$', '', source))
#remove rows with grouped
mc3_edges_new2 <- mc3_edges_new %>%
anti_join(mc3_edges_new_filtered)
#Add rows in step #2
mc3_edges_new2 <- mc3_edges_new2 %>%
bind_rows(mc3_edges_new, mc3_edges_new_split)
#group
mc3_edges_new_groupby <- mc3_edges_new2 %>%
group_by(source, target, type) %>%
summarize(weight = n()) %>%
filter(weight >1) %>%
ungroup()| Name | mc3_edges_new_groupby |
| Number of rows | 3703 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 7 | 57 | 0 | 1485 | 0 |
| target | 0 | 1 | 6 | 27 | 0 | 2887 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weight | 0 | 1 | 2 | 0.09 | 2 | 2 | 2 | 2 | 5 | ▇▁▁▁▁ |
[1] "tbl_df" "tbl" "data.frame"
Maximum length of source has reduced from 213 to 57, after removing lot of c(“,”) values.
source_missing <-setdiff(mc3_edges_new_groupby$source, mc3_nodes_fishery_new$id)
source_missing_df <- tibble(
id = source_missing,
country = rep(NA_character_, length(source_missing)),
type = rep("Company", length(source_missing)),
revenue = rep(NA_real_, length(source_missing)),
product_services = rep(NA_character_, length(source_missing))
)
target_missing <- setdiff(mc3_edges_new_groupby$target, mc3_nodes_fishery_new$id)
target_missing_df <- tibble(
id = target_missing,
country = rep(NA_character_, length(target_missing)),
type = rep("Company", length(target_missing)),
revenue = rep(NA_real_, length(target_missing)),
product_services = rep(NA_character_, length(target_missing))
)mc3_nodes_fishery_new_filtered <- mc3_nodes_fishery_new %>%
filter(id %in% c(mc3_edges_new_groupby$source, mc3_edges_new_groupby$target))
mc3_nodes_fishery_new_df <- bind_rows(mc3_nodes_fishery_new_filtered, source_missing_df, target_missing_df)
mc3_nodes_fishery_new_df <- mc3_nodes_fishery_new_df %>%
mutate(revenue_omu = as.character(revenue_omu))mc3_nodes_fishery_grouped <- mc3_nodes_fishery_new_df %>%
group_by(id) %>%
summarize(
count = n(),
type_1 = ifelse(n() >= 1, type[1], NA),
type_2 = ifelse(n() >= 2, type[2], NA),
type_3 = ifelse(n() >= 3, type[3], NA),
country = ifelse(n() == 1, country, paste(unique(country), collapse = ", ")),
revenue_omu = ifelse(n() == 1, revenue_omu, paste(unique(revenue_omu), collapse = ", ")),
product_services = ifelse(n() == 1, product_services, paste(unique(product_services), collapse = ", "))
)# A tibble: 10 × 2
Products Occurrences
<chr> <int>
1 <NA> 3560
2 Fish and seafood products 37
3 Seafood products 23
4 Canning, processing and manufacturing of seafood and other aquat… 18
5 Fish and fish products 15
6 Seafood 11
7 Fish and sea food products 9
8 Fish and seafoods products 9
9 Fresh and frozen seafood 9
10 Tuna, sword fish, bass, trout, and salmon, as well as offers she… 9
ids_fishing <- mc3_nodes_fishery_grouped %>%
filter(str_detect(product_services, paste(words_fishery, collapse = "|", sep = "")) | is.na(product_services)) %>%
pull(id)
mc3_edges_fishery <- mc3_edges_new_groupby %>%
filter(source %in% ids_fishing)
mc3_nodes_fishery_ <- mc3_nodes_fishery_grouped %>%
filter(id %in% c(mc3_edges_fishery$source, mc3_edges_fishery$target))mc3_fish_graph <- tbl_graph(nodes = mc3_nodes_fishery_,
edges = mc3_edges_fishery,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness()) %>%
ggraph(layout = "nicely") +
scale_edge_width(range = c(0.01, 6)) +
geom_node_point(aes(colour = type_1,
size = betweenness_centrality)) +
theme_graph() +
labs(size = "Betweenness Centrality")
mc3_fish_graph 
I can observe that almost all of the refined Nodes has type equal to “Company”, with a few nodes with relatively high betweenness centrality near the center
mc3_edges_fishery_in <- mc3_edges_fishery %>%
rename(from = source, to = target)
mc3_nodes_fishery_in <- mc3_nodes_fishery_ %>%
rename(group = type_1)
# Create a visNetwork object with nodes and edges
visNetwork(nodes = mc3_nodes_fishery_in, edges = mc3_edges_fishery_in) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE) %>%
visLegend() %>%
visLayout(randomSeed = 123)# A tibble: 1,324 × 1
# Groups: target [508]
target
<chr>
1 Elizabeth Jones
2 Michael Morrison
3 Amanda Robinson
4 Andrew Taylor
5 Brandon Cruz
6 Michael Thompson
7 Melissa Martin
8 Christopher Ramos
9 Richard Smith
10 Andrew Reed
# ℹ 1,314 more rows
# A tibble: 2,887 × 2
target count
<chr> <int>
1 Michael Johnson 11
2 John Smith 10
3 Brian Smith 8
4 Jennifer Johnson 8
5 Michael Smith 8
6 Richard Smith 8
7 David Smith 7
8 James Brown 7
9 James Smith 7
10 Melissa Brown 7
# ℹ 2,877 more rows
filtered_data <- mc3_edges_fishery %>%
group_by(target) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
select(target, count)
filtered_data %>%
group_by(count) %>%
summarise(n = n()) %>%
mutate(percentage = round(n/sum(n) * 100, 2)) %>%
ggplot(aes(x = count, y = n, label = paste0(round(percentage, 2), "%"))) +
geom_bar(stat = "identity", fill = "lightblue", color = "black") +
xlab("Count") +
ylab("n") +
scale_x_continuous(breaks = unique(filtered_data$count)) +
geom_text(position = position_stack(vjust = 0.5))
Nearly 80% of the targets have count of 1. Decided to use the Pareto rules to look at the top 20% only.
filtered_data2 <- mc3_edges_fishery %>%
group_by(target) %>%
summarise(count = n()) %>%
filter(count > 1) %>%
arrange(desc(count)) %>%
select(target, count)
filtered_data2 %>%
group_by(count) %>%
summarise(n = n()) %>%
mutate(percentage = round(n/sum(n) * 100, 2)) %>%
ggplot(aes(x = count, y = n, label = paste0(round(percentage, 2), "%"))) +
geom_bar(stat = "identity", fill = "lightblue", color = "black") +
xlab("Count") +
ylab("n") +
scale_x_continuous(breaks = unique(filtered_data$count)) +
geom_text(position = position_stack(vjust = 0.5))
Left with about 500 companies. The future work is that there should be a way to define the cut-off value on company ownership for creating a subgraph that focuses on company owners who own relatively higher number of comapnies. Concious of time, I have decided to look at just top 5 targets in terms of the number of counts, starting with Michael Johnson.
Michael Johnson is a sole owner of many smaller entities. Revenue of most of the companies are unknown, hinting us some of these might be paper companies
The analysis on Top 5 targets verifies that Michael Johnson is not a unique case. Top 5 targets are sole owners of many smaller entities. Revenue of most of the companies they own are unknown, hinting us some of these might be paper companies possibly involved in transshipment.
Through data exploration, I was able to observe a few anomalies.
Individuals who have ownership in multiple companies. From analyzing three sub-network graphs, it was observed that these individuals tend to own a combination of large and small firms from various countries. While there is a possibility that everything is legitimate, it would be beneficial for FishEye to conduct a thorough examination of these individuals who own companies across borders, particularly when they are the sole owners of smaller entities, as exemplified in the case of ‘Michael Johnson’ and other top targets. Revenue of most of the companies they own are unknown from the data given, which may require authorities and regulatory bodies’ scrutiny
Almost all of the refined Nodes has type equal to “Company”, there were no beneficial owners and company contacts. This is expected looking from the distribution of types for Node Dataframe intiially, but it might be noteworthy to investigate further what id type mean for illegal fishery.